K | # of bigrams | # of trigrams | # of 4-grams | # of 5-grams | # of 6-grams |
---|---|---|---|---|---|
100 | 65 | 85 | 90 | 97 | 99 |
1000 | 318 | 610 | 782 | 910 | 972 |
10000 | 846 | 2569 | 4586 | 6483 | 8014 |
100000 | 1914 | 9223 | 23530 | 39171 | 55726 |
1000000 | 3580 | 20696 | 63567 | 118881 | 178863 |
How many different letter-N-grams do we find at the beginning of a word? Of course we will find many unexpected N-grams, but the will have low frequency. This is the reason to count these numbers for different ranges and use the top K=10n words (n=2, 3, 4, 5, 6).
For a better understanding we plot a diagram with both axis in logarithmic scale.
The numbers in the table correspond to the variability in word formation. These numbers are expected to vary strongly between different languages. As a measure we can take the slope of the nearly linear part of the graphs.
The slope usually changes in the right part of the diagram because the total number of words is limited by corpus size. Larger corpora will give better results.
For K=1000:
select 1000, count(distinct left(word,2)) as n2, count(distinct left(word,3)) as n3, count(distinct left(word,4)) as n4, count(distinct left(word,5)) as n5, count(distinct left(word,6)) as n6 from words where w_id>100 and 1100>w_id;
Explain the unexpexted straight lines in the diagram!
What is the right range to estimate the slope?
3.8.2 Number of letter-N-grams at word endings